Derivatives with respect to matrices

What is the gradient of a scalar function with respect to a matrix argument?

Remember that the gradient of a scalar valued function with respect to a vector argument is a vector of the same size, similarly the derivative of a scalar with respect to a matrix is a matrix of the same size.

The trace (sum of diagonal elements) appears often in scalar valued functions of matrices, so we start with its definition and an example: $\newcommand{\trace}{\mathop{\text{Tr}}}$ \begin{eqnarray} \trace X = \sum_i X(i,i) \end{eqnarray}

\begin{eqnarray} f(X) & = & \trace \left( \begin{array}{cc} X_{1,1} & X_{1,2} \\ X_{2,1} & X_{2,2} \end{array} \right) = X_{1,1} + X_{2,2} \\ \frac{df}{dX} & = & \left( \begin{array}{cc} \partial{f}/\partial X_{1,1} & \partial{f}/\partial X_{1,2} \\ \partial{f}/\partial X_{2,1} & \partial{f}/\partial X_{2,2} \end{array} \right) = \left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) \end{eqnarray}

The trace is like the inner product of the matrix $X$ (viewed as a vector) and the identity matrix (viewed as a vector). In the following, we will introduce a systematic way for computing these defivatives where we will make use of the Kronecker delta symbol that is defined as $$ \delta(i, j) = \delta(j, i) = \left\{ \begin{array}{cc} 1 & i = j \\0 & i\neq j \end{array} \right. $$

Before delving into the business of computing the derivatives, let us warm up a little. An important property of the trace is that the matrices can be rotated in a trace \begin{eqnarray} \trace A X = \trace X A \end{eqnarray}

Using the Kronecker delta, the trace operator can be written as \begin{eqnarray} \trace X = \sum_i \sum_j X(i,j) \delta(i, j) \end{eqnarray}

To prove this result, we will write the product explicitly using the index notation \begin{eqnarray} (A X)(i,j) &=& \sum_k A(i, k) X(k, j) \end{eqnarray} \begin{eqnarray} \trace A X & = & \sum_i \sum_j \left(\sum_k A(i, k) X(k, j) \right) \delta(i, j) \\ & = & \sum_k \sum_i A(i, k) X(k, i) \\ & = & \sum_k \sum_i X(k, i) A(i, k) \\ & = & \sum_k \sum_r \sum_i X(k, i) A(i, r) \delta(r,k) \\ & = & \sum_k \sum_r (X A)(k, r) \delta(r,k) = \trace X A \end{eqnarray}

Let's now focus on the derivative of the trace of a product. We define \begin{eqnarray} f(X) = \trace A X & = & \sum_i \sum_j \left(\sum_k A(i, k) X(k, j) \right) \delta(i, j) \end{eqnarray} where $A$ is $I \times K$ and $X$ is $K \times I$. Note that the result matrix in the trace must be a square matrix as otherwise the result is rectangular matrix and the trace does not make any sense.

The gradient with respect to $X$ will be a object of the same size as $X$, with each entry defined by $$ \frac{d f}{d X} = [\frac{\partial f(X)}{\partial X(u, r)}] $$ Formally, by the chain rule we have \begin{eqnarray} \frac{\partial f(X)}{\partial X(u, r)} & = & \frac{\partial f(X)}{\partial X(k, j)}\frac{\partial X(k, j)}{\partial X(u, r)} = \frac{\partial f(X)}{\partial X(k, j)} \delta(u, k) \delta(r, j) \end{eqnarray}

We can now sum over the indices to remove the Kronecker delta symbols \begin{eqnarray} \frac{\partial f(X)}{\partial X(u, r)} & = & \sum_i \sum_j \sum_k A(i, k) \delta(u, k) \delta(r, j) \delta(i, j) \\ & = & \sum_i \sum_j A(i, u) \delta(r, j) \delta(i, j) \\ & = & \sum_j A(j, u) \delta(r, j) = A(r, u) = (A^\top)(u, r) \\ \end{eqnarray}

We have just shown that \begin{eqnarray} \frac{d f}{d X} = A^\top \end{eqnarray}

Example: Derivative of $\trace A^\top X A$

We will use the results above \begin{eqnarray} f(X) & = & \trace A^\top X A = \trace A A^\top X \\ \frac{d f}{d X} &= & (A A^\top)^\top = A A^\top \end{eqnarray}

Now, this quantity isnew. \begin{eqnarray} f(X) & = & \trace A^\top X A = \trace A A^\top X \\ \frac{d f}{d A} &= & ? \end{eqnarray}

\begin{eqnarray} f(A) = \trace A^\top X A & = & \sum_i \sum_j \left(\sum_k \sum_l (A^\top)(i, k) X(k, l) A(l, j) \right) \delta(i, j) \\ \frac{d f}{d A} &= &[\frac{\partial f(A)}{\partial A(u, r)}] \\ \frac{\partial f}{\partial A(u,r)} & = & \frac{\partial}{\partial A(u,r)} \sum_i \sum_j \left(\sum_k \sum_l (A^\top)(i, k) X(k, l) A(l, j) \right) \delta(i, j) \\ & = & \sum_i \sum_j \left(\sum_k \sum_l \delta(k, u) \delta(i, r) X(k, l) A(l, j) \right) \delta(i, j) \\ & & + \sum_i \sum_j \left(\sum_k \sum_l A(k, i) X(k, l) \delta(l, u) \delta(j, r) \right) \delta(i, j) \\ & = & \sum_i \sum_j \sum_k \sum_l \delta(k, u) \delta(i, r) X(k, l) A(l, j) \delta(i, j) \\ & & + \sum_i \sum_j \sum_k \sum_l A(k, i) X(k, l) \delta(l, u) \delta(j, r) \delta(i, j) \\ & = & \sum_i \sum_j \sum_l \delta(i, r) X(u, l) A(l, j) \delta(i, j) \\ & & + \sum_i \sum_j \sum_k A(k, i) X(k, u) \delta(j, r) \delta(i, j) \\ & = & \sum_j \sum_l X(u, l) A(l, j) \delta(r, j) + \sum_j \sum_k A(k, j) X(k, u) \delta(j, r) \\ & = & \sum_l X(u, l) A(l, r) + \sum_k A(k, r) X(k, u) \\ & = & \sum_l X(u, l) A(l, r) + \sum_k (X^\top)(u,k) A(k, r) \\ & = & (X A)(u, r) + (X^\top A)(u,r) \\ & = & (X+X^\top) A \end{eqnarray}

Example: Derivative of $\trace A X A^\top$

This looks quite similar

\begin{eqnarray} f(A) & = & \trace A X A^\top = \sum_i \sum_j \left(\sum_k \sum_l A (i, k) X(k, l) (A^\top)(l, j) \right) \delta(i, j) \\ & = & \sum_i \sum_j \sum_k \sum_l A (i, k) X(k, l) A(j, l) \delta(i, j) \\ \frac{\partial f}{\partial A(u,r)} & = & \sum_i \sum_j \sum_k \sum_l \delta(u,i) \delta(r,k) X(k, l) A(j, l) \delta(i, j) + \sum_i \sum_j \sum_k \sum_l A (i, k) X(k, l) \delta(u,j) \delta(r,l) \delta(i, j) \\ & = & \sum_l X(r, l) A(u, l) + \sum_k A (u, k) X(k, r) \\ & = & \sum_l A(u, l) (X^\top)(l, r) + \sum_k A (u, k) X(k, r) \\ & = & (A X^\top)(u,r) + (A X)(u, r) \end{eqnarray}

So the derivative is $$ \frac{d f(A)}{d A} = A (X^\top + X) $$

Matrix Factorization by Alternating Least Squares (ALS)

We define the Frobenious norm $$ \|E\|_F = \sqrt{\trace E^\top E} $$

By using the results that we have just derived we obtain we can derive the derivative of the Frobenious norm of the error matrix $E$ as

\begin{eqnarray} E^\top E &= & (X - M C)^\top (X - M C) \\ & = & X^\top X + C^\top M^\top M C - X^\top M C - C^\top M^\top X \\ \|E\|_F^2 & = & \trace \left( X^\top X + C^\top M^\top M C - 2 X^\top M C \right) \end{eqnarray}

By the linearity of the trace \begin{eqnarray} \|E\|_F^2 & = & \trace \left( X^\top X + C^\top M^\top M C - 2 X^\top M C \right) \\ \frac{d\|E\|_F^2}{d C} & = & 2 M^\top M C - 2 M^\top X \\ C & = & (M^\top M)^{-1} M^\top X \end{eqnarray}

\begin{eqnarray} \|E\|_F^2 & = & \trace X^\top X + \trace C^\top M^\top M C - 2 \trace X^\top M C \\ & = & \trace X^\top X + \trace M C C^\top M^\top - 2 \trace C X^\top M \\ \frac{d\|E\|_F^2}{d M} & = & 2 M C C^\top - 2 X C^\top \\ M & = & X C^\top (C C^\top)^{-1} \end{eqnarray}

$B = (A^\top A)^{-1} A^\top X$

$A = X B^\top (B B^\top)^{-1}$